Fault Tolerance

Review: write about fault tolerance problem being addressed, high level technique: replication (where and how), re-execution, etc.

Comments from review:

- questioning assumptions about trends (e.g. self configuring systems, hardware getting more reliable)

o QUESTION: What is really happening?

- how accurate is the data? It came from one system

o QUESTION: what do you expect?

Intro

Reliability: how long do you execute before a failure

i. MTTF

Availability: what is probability if you request service you get it

i. MTTF / MTTF + MTTR

ii. How make high availability?

1. Make MTTF big (highly reliable) or MTTR small (fast to repair)

iii. 99% ~3 days

iv. 99.9% ~9 hours

v. 99.99% ~1 hour

vi. 99.999% ~5 minutes

vii. 99.9999% ~30 seconds

What is cost of an hour of downtime (in 2002)?

i. Brokerage: $6,000,000

ii. Ebay: $225,5000

iii. Cell phone activation: $41,000

iv. Home shopping channel: $113,000

What is MTTF for a disk?

i. 900,000 hours – 10 years

What is MTTF for an OS?

i. Windows 2000: 72 weeks

Failures

i. Terminology:

1. Fault = bug in code

2. Error = erroneous state as a result of executing code

a. Latent errors: executed fault but did not cause failure yet

3. Failure = system does not act according to its specification

ii. Types

1. Bohr bugs / deterministic bugs:

a. Bugs that recur every time you do something – easily repeatable / predictable / can be tracked down and fixed / often found in testing

2. Heisenbugs / nondeterministic bugs

a. Bugs that don’t recur every time / caused by an unlikely combination of events / hard to reproduce and repair

iii. Causes of failure

1. Hardware (cpu, devices) – 18%

2. Environment (network, power) – 14%

3. Software (OS, applications) – 25%

4. Operations (maintenance, administration) – 42%

iv. When do failures occur?

1. Infant mortality – new, under tested

2. Norma lifetime – highly reliable

3. Wear-out period (for HW) – things break physically, or (for SW) assumption about world have changed too much

v. Failure models – Why important?

1. Timing failures occur when a component violates timing constraints.

2. Output or response failures occur when a component outputs an incorrect value.

3. Omission failures occur when a component fails to produce an expected output.

4. Crash failures occur when the component stops producing any outputs.

5. Byzantine or arbitrary failures occur when any other behavior, including malicious behavior, occurs

vi. Synthetic failure models

1. Halt on failure

2. Failure status

3. Stable Storage

vii.

Approaches:

i. Fault Avoidance: make sure failures don’t happen

1. Fault prevention: write code without bugs

a. better languages

b. better software engineering

c. tool usage during coding process

d. e.g. write a new OS in a new language, prove properties of implementation

2. Fault removal: remove bugs from code

a. e.g. run testing tool (valgrind, purify)

b. windows static driver verifier – find bugs statically

3. Fault workaround: make sure failures don’t execute

a. Firewall / virus detector

b. “It hurts when I run” à “don’t run”

ii. Fault Tolerance

1. Allow failures to occur, but keep system running

2. Basic ideas:

a. Fault detection – figure out that something bad happened

b. Isolation – keep bad state from spreading to whole system

c. Recovery – get the bad part back into a good state

3. Basic approaches to error detection

a. Check dynamically for error conditions and inconsistencies to detect failures early

b. Use heart beats to make sure a module is still executing

c. QUESTION: how easy it to do this generically?

i. QUESTION: as code evolves?

ii. QUESTION: at what cost?

4. Basic approaches to isolation

a. Decompose into modules

i. Unit of failure is small

b. Check each module for errors

i. Fails fast – doesn’t spread corruption

ii. Isolate from other modules

c. Hardware / software boundaries around modules

i. Whole machine

ii. address space

iii. extra instructions

5. Basic approaches to recovery

a. Restore system to a functioning state

i. E.g. configure extra modules to take over for failed module, restart failed module

b. Forwards / Backwards

c. Concealing / revealing

d. Basic approaches:

i. Logging / retry

ii. Checkpoint / restore

iii. Replicate (process pairs)

iv. Alternate versions

v. Transactions (undo)

vi. Reveal faults up the stack

e. Concepts:

i. Have multiple Ys, Multiple Xs that are identical. Switch between Xs when Ys fail

1. Fault Tolerance

ii. Isolate X from Y so survival of X does not depend on Y

1. Fault Containment

2. Some useful things fail, but not all - partitioning

f. Redundancy: do things twice or more

i. On two machines

ii. In two processes

iii. In two places (state in memory / on disk checkpoint)

iv. At two times (e.g. checkpoint / restore)

v. QUESTION: what kinds of bugs are handled?

g. Diversity: do things multiple different ways

i. Different platforms

ii. Different implementations

iii. Idea: unlikely to have common failure modes

iv. Name: n-version programming, recovery blocks

6. Basic questions for fault tolerance: where do you do the fault tolerance?

a. In the hardware (e.g. two processors, RAID with multiple disks)

b. Between the HW and the OS (e.g. virtual machine)

c. Within the OS

d. Between the OS and the application

e. Within the application

7. General principle:

a. If everything above layer X is identical, can tolerate faults at X or below automatically

i. E.g. FT unix -> HW, OS faults

ii. E.g. Hypervisor -> HW faults

iii. E.g. Nooks -> driver faults (everything else is above)

iv. E.g. Disco -> OS faults

b. If have some diversity above X, can tolerate heisenbugs above layer X

i. Process pairs – execute different streams

ii. Checkpoint / restart: if restart far enough back

Goals for fault tolerance

High performance

i. Not much additional cost over unreliable

Low cost

i. Not much additional hardware or software

Transparent to existing code

i. Can make existing programs / os more reliable

Tolerates lots of failures

i. Hardware

ii. Software

iii. human

(Gray) Approaches to Redundancy

Process pairs:

i. Run two copies

ii. Switch from one to the other on failure of one

How to use:

i. Lockstep processes – HW failures only

1. Both CPU do same work, no extra capacity

ii. Explicit State checkpoints – do computation, send state changes to backup

1. Backup can do computations from latest state

iii. Automatic checkpoints – log messages

1. Inefficient – don’t know what to checkpoint, must send everything

iv. Delta checkpoints – send operations, not state

1. re-execute on other side. Reduces bandwidth

v. Persistent processes – only replicate persistent data and session existence, not transient per-session data – internal in-memory data structures

1. Make state changes persistent: e.g. all on disk

2. On failure, backup wakes up knowing sessions but not state

3. On failure, internal state is in unknown, inconsistent situation

Transactions

i. Group of operations that form a consistent transformation of state - ACID

1. Atomic – all or nothing

2. Consistent – every transactional execution sees a correct picture of the state, even if other transactions are excuting

3. Integrity – is a correct state transformation

4. Durable – transactions had effects even if a failure occurs after transaction

ii. Operations

1. Begin transaction

2. Commit – make effects durable

3. Abort – undo partial effects

iii. Use for fault tolerance

1. Allows use of persistent process pairs

a. Allows undo of actions in a transaction that aborted

b. Allows reset of system to known good state

iv. QUESTION: What is great about transactions?

1. Can reason about state of system with failures

v. QUESTION: Why not

1. Programming cost

2. Performance cost – extra communication

vi. QUESTION: What is MTTR here?

1. Must detect failure

2. Backup must abort in-progress transactions

a. no state to sync or log to replay

FT communication

i. Session abstraction

1. Sequenced

2. Retry on alternate path if path fails

3. Notify endpoints if all paths fail

4. Sessions handle switching to backup automatically if primary fails

5. On TX abort, sequence number reverts to beginning of transaction, intervening messages cancelled

FT storage

i. Store on multiple disks

ii. Many replication options – take 739 for details

iii. Transactions + logs for ensuring storage updated consistently